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Mango is a popular fruit for local consumption and export commodity. 
Currently, Indonesian mango export at 37.8 M accounted for 0.115% of 
world consumption. Pest and disease are the common enemies of mango that 
degrade the quality of mango yield. Specialized treatment in export 
destinations such as gamma-ray in Australia, or hot water treatment in 


Korea, demands pest-free and high-quality products. Artificial intelligence 

helps to improve mango pest and disease control. This paper compares the 
Keywords: deep learning model on mango fruit pests and disease recognition. This 
CNN research compares Visual Geometry Group 16 (VGG16), residual neural 
network 50 (ResNet50), InceptionResNet-V2, Inception-V3, and DenseNet 
architectures to identify pests and diseases on mango fruit. We implement 
transfer learning, adopt all pre-trained weight parameters from all those 
architectures, and replace the final layer to adjust the output. All the 
architectures are re-train and validated using our dataset. The tropical mango 
dataset is collected and labeled by a subject matter expert. The VGG16 
model achieves the top validation and testing accuracy at 89% and 90%, 
respectively. VGG16 is the shallowest model, with 16 layers; therefore, the 
model was the smallest size. The testing time is superior to the rest of the 
experiment at 2 seconds for 130 testing images. 
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1. INTRODUCTION 

Mango is a potential commodity for local consumption and export. Indonesian coastal areas with 
high sun exposure all year round are suitable for mango. The regional consumption is 0.5 kg per capita per 
year, while exports account for 0.11% of world mango consumption. The international demand is high; 
however, due to the importer country requirements, Indonesian mango is challenging to enter the global 
market. Pest and disease-free requirements in Japan, Australia, and the Korean market hinder Indonesian 
mango from accessing their market. Therefore, pest and disease control to ensure the fruit product’s quality 
plays a significant role in improving international market acceptance. 

The ability of farmers to identify pests and diseases and proper handling is a significant factor. 
There are two mango farm models: the big professional farm and household mango trees. On big farms such 
as in east java and west java, experienced farmers manage large-scale areas. While in the household mango, 
some people grow a few trees around their house. Skilled people carry out pest and disease control on a big 
farm with sufficient knowledge. However, it is unavailable for general people with few mango trees around 
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their houses. An innovative way to disseminate pests and diseases control techniques is desirable to 
overcome the knowledge gap. 

Mobile application and image recognition are an opportunity to alleviate the pest and disease control 
dissemination knowledge problem. The penetration of smartphones is currently at about 56% of Indonesian 
citizens. Therefore, it is feasible to deliver knowledge through the smartphone. Image recognition has 
matured to detect many visual clues, including the pest and diseases on leaves and fruits. With the help of the 
mobile application, image recognition has been implemented in many recognition work for pests and diseases 
of plants such as [1], [2]. 

Deep learning has enjoyed tremendous success in classification tasks, particularly for image data. 
The availability of huge labeled datasets such as ImageNet [3] enables researchers to propose, test and 
validate many convolutional neural network (CNN)-based architectures. The transfer learning concept 
enables researchers to use knowledge learned by other problem sets (datasets) to be implemented in their 
specific problems with smaller dataset sizes [4]. In transfer learning concepts, we can use the weight 
parameter. To adjust the network to the new problems, fine-tuning was carried out. The adjustment can be 
applied to the entire network of only selected layers. Identify the best performance deep learning 
architectures, selecting which part of the layer needs tuning/adjusting. Currently, researchers proposed many 
well-known deep learning architectures, to name a few: AlexNet [5], AlexNetOWTBn [6], GoogLeNet [7], 
Overfeat [8], visual geometry group network (VGGnet) [9], residual neural network (ResNet) [10], 
InceptionResnet-V2 [11], Inception-V1 [7], Inception-V2 [12], Inception-V3 [12], and Inception-V4 [11], 
and DenseNet [13]. They all have different network structures, number of layers, size of filters and many 
other differences. Those lead to different weight parameters ranging from thousands to hundreds of millions. 
Consequently, they have different computational complexity, training and testing time. 

This research is carried out to extend our previous work on recognizing mango pests on leaves. 
The recognition system for pests on mango leaves has been implemented on a mobile application [14]. 
Following up on suggestions on evaluation results [15], we improve the capability as an extension to the 
fruit. This research collects the mango fruit dataset and involves pests and diseases expert to manually 
classify the images into five classes. The appearance of pests and diseases of mango fruit in each class is 
shown in Figure 1(a) to Figure 1(e). 

With the dataset size on hand we expect to be able to recognize the image collected in the real farm 
through the mobile application. Accurate and high speed recognition is desired to serve the mobile 
application. We aim to seek the most acceptable performance deep learning architecture with high accuracy 
and fast recognition. Therefore, we compare available architectures in transfer learning mode and compare 
their speed and accuracy. 


F 


(a) (b) (c) (d) (e) 


Figure 1. Type of mango fruit pest and disease: (a) Capnodium mangiferae, (b) Cynopterus titthaecheilus, 
(c) Deanolis albizonalis, (d) Pseudaulacaspis cockerelli, and (e) Pseudococcus longispinus 


2. RELATED WORKS 

Plant pest and disease recognition based on visual data has attracted computer vision researchers in the 
past five decades. Researchers employ a support vector machine (SVM) to classify pomegranate fruit images 
into four classes consisting of normal and three infection stages [16]. Support vector machine classifies the 
pomegranate images using some features, which are color coherence vector (CCV), color histogram, and shape. 
They achieve 82% accuracy on the testing dataset. A classification of 82 disease classes on 12 plants has been 
reported by Barbedo et al. [17]. Evaluation of each plant has been carried out independently. The input image 
was segmented using guided active contour (GAC). Histogram similarity to the reference image is ranked as the 
basis of disease recognition. In [18], the image is segmented into the uninfected and infected regions. 
Researchers rely on the hue and co-occurrence matrix of infected leaf images to extract the features. Their work 
achieves 95.71% accuracy using SVM to predict the type of leaf disease. In [19], a comparison between a 
sparse representation-based classification (SRC), SVM, and artificial neural network (ANN) is carried out to 
classify cucumber leaf disease. The SRC outperforms SVM and ANN at 85.7% accuracy. A combination of 
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dictionary learning and spare representation has been reported 92.9% accuracy on Caltex Leaves dataset [20]. 
Tan et al. [21] employ SVM to recognize a particular cacao fruit disease called cacao black pod rot (BPR) 
using k-means clustering and SVM. Their experiments reported 84% accuracy in recognizing BPR. 

In the last decade, deep learning gained popularity and showed a breakthrough performance in 
image classification. Computer vision researchers made use of deep learning algorithms in recognizing plant 
pests and diseases such as in [22], [23] to recognize various leaf diseases [24] in a controlled condition. 
The promising result has been achieved at 96.3% of precision and 99.35% of accuracy. CNN was utilized by 
Krizhevsky et al. [5]. The transfer learning concept allows researchers to adopt a model trained by a huge 
dataset like image-net and adapt to particular cases with a smaller number of data. Deng et al. [25] implemented 
transfer learning and carried out fine-tuning to achieve high-performance classifier. Lu et al. [26] classified rice 
leaf disease using CNN and reported testing accuracy at 95.48%. They also identified that the stochastic pooling 
layer gave the best results after evaluating three different pooling layers. Wheat is an important food source. 
Therefore, in 2017 researchers collected seven classes of wheat disease in the wild called wheat disease 
database 2017 (WDD2017). In [27], they reported CNN-based networks with no fully connected layer (FCL) 
layers have been superior compared to the original CNN in classifying the WDD2017. Too et al. [28] reported a 
comparison of some well-known CNN architectures in classifying the PlantVillage dataset [24]. They evaluated 
VGGnet [9], Inception V4 [11], DenseNet [13], and ResNet [10]. According to their experiments, the DenseNet 
outperforms the rest of the architectures at 99.75% of accuracy. Researchers introduced a mobile-based wheat 
leaf disease recognition at [2]. They used ResNet architecture with 50 layers to carry out a classification task, 
and it showed promising classification performance at 96% of accuracy. Ferentinos [29] has put their effort in 
classifying leaf disease problems using five CNN architectures which are AlexNet [5], AlexNetOWTBn [6], 
GoogLeNet [7], Overfeat [8], VGGnet [9]. They reported that VGGnet outperforms the rest of the 
architectures, and it reaches 99.48% accuracy. A study of key factors impacting deep learning performance has 
been reported in [30]. They found that image background, image capture conditions, symptom representation, 
covariate shift, symptom segmentation, symptom variations, simultaneous disease, and symptom similarity are 
impacting factors to the deep learning performance. In [31], independent processing to each color channel input 
was introduced. The result was combined as an input of FCL. They evaluated the three channel CNN, 
GoogleNet, and LeNet-5 [32] to classify cucumber leaf disease and found that the three channels CNN achieved 
the best accuracy at 91.15%. In [33], the apple trunk disease recognition was carried out using VGGnet. They 
compared a VGGNet with Focal loss and softmax loss function. The VGGnet using focal loss function better 
performance with 2% margin at 94.5% accuracy compared VGGnet with softmax loss function. In [34], 
VGGnet was used to recognize mildew diseases and reach 95% of accuracy. Barbedo [35] reported that a 
classification task of 14 leaf diseases attain 94% of accuracy on implementation of GoogleNet architecture. 

Despite the popularity of mango, there are a limited number of studies on mango pest and diseases 
recognition. The author reported 48.95% of accuracy on a recognition task of four diseases and a normal leaf 
using SVM [36]. They extract several features from the gray-level co-occurrence matrix (GLCM) matrix 
such as contrast, correlation, energy, homogeneity, mean, standard-deviation, entropy, root mean square 
(RMS), variance, smoothness, kurtosis, and skewness. Singh et al. [37] used CNN to recognize anthracnose 
disease and reported 97.13% accuracy. The obvious visual cue was responsible for the high achievement of 
this task. In our previous work, we classified mango pests [1] on affected leaf images. We collected the 
dataset [38] from mango farms in Indonesia and organized them into sixteen classes. We implemented 
augmentation techniques such as noise addition, blur, contrast, and affine transformation (i.e., rotation and 
translation in Cartesian coordinate) in order to improve the performance of VGGnet classifier. According to 
our experiments, augmentation successfully improved the accuracy by a 4% margin after using augmented 
images in the training phase. 


3. RESEARCH METHOD 

This is interdisciplinary research that involves mango pest expert and computer scientists. 
The dataset was collected around Indonesia. The image collection is labelled by mango pest expert. Once the 
dataset is labeled, data size standardization is carried out. 


3.1. Dataset 

The dataset consists of 653 labeled mango fruit images with five pests and a disease identified. In the 
real case, usually only one pest on particular mango fruit as reflected in the dataset each fruit images have a 
single pest label. The dataset is divided into training, testing, and validation set at 60%, 20%, and 20% 
respectively as presented in Table 1. 
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Table 1. Number of images in dataset 


Pest/disease name Train Valid Test _ Total 
capnodium_mangiferae 38 13 13 64 
cynopterus_titthaecheilus 63 21 21 105 
deanolis_albizonalis 86 28 28 142 
pseudaulacaspis_cockerelli 159 52 52 263 
pseudococcus_longispinus 47 16 16 79 
Total 393 130 130 653 


3.2. Deep learning image classifier 

We implement a convolutional neural network using five well-known architectures. Their names are 
VGG16, RestNet50, InceptionResNet-V2, Inception-V3, and DenseNet. They will be discussed in following 
sub sections. 


3.2.1. VGG16 model 

We apply a CNN architecture named VGG16 which was used to win Imagenet competition in 2014. 
Figure 2 presents the detail architectures. This research adopts transfer learning methods as weight 
initialization. The VGG16 network has already been trained upon ImageNet dataset. So the initial weights of 
our network are duplicates of the ImageNet pre-trained model. A replacement of the final layers is carried out in 
the original VGG16 architecture. All the layer is freezed to retain the trained weight from the ImageNet, while 
the training set is performed to train the last replaced layer only. By limiting the weight for the last layer of the 
network, we can speed up the training time without sacrificing the classifier accuracy. Consequently, we only 
train the last layer, which is the fully connected layer (FCL) with a softmax function. 


conv 3x3 conv 3x3 > ea TH conv 3x3 conv 3x3 conv 3x3- max:pool 
2x2 2x2 
224 x 224 x3 x2 x3 
R: relu, S: softmax 256 5 


Figure 2. VGG16 architectures 


The VGG16 is a convolutional neural network with 16 layers, and it is quite heavy computing 
complexity in the training process. However, the VGG-16’s trained model execution is fast on the personal 
computer because CNN is a parameterized learning. By multiplying the saved weights again with a new 
sample, the model can predict its class. The training process is executed in a dedicated deep learning server. 
Therefore, computational load is not a significant matter. This research develops a high-accuracy model to 
serve client applications that recognize the pest from fruit images. 


3.2.2. ResNet50 model 

He et al. [10] introduced the ResNet model in their publication, which served as the basis for 
The ImageNet large-scale visual recognition challenge ILSVRC) 2015 and Microsoft Common Objects in 
Context (COCO) 2015 classification challenges. Their model was ranked first in ImageNet classification with 
an error rate of 3.57%. Multiple non-linear layers’ failure to learn identity mappings and the degradation 
problem spurred the development of the deep ResNet. 

ResNet is a network-in-network (NIN) architecture that is built on a foundation of numerous stacked 
residual units. These leftover units are the network’s building blocks. A collection of residual units serves as 
the foundation for the ResNet architecture [10]. Convolution, pooling, and layering are used to create the 
residual units. The architecture is comparable to that of the VGG network [9], which consists of 33 filters, 
although ResNet is approximately eight times deeper. This is because global average pooling is used instead of 
fully connected layers. ResNet was further updated to improve accuracy by changing the residual module to use 
identity mappings. As in [10], a ResNet model of 50, 101, and 152 layers were built and loaded with pre-trained 
weights from ImageNet. Finally, a bespoke softmax layer was constructed for the purpose of identifying plant 
diseases. We use ResNet50 in this research. The architecture is shown in Figure 3. 
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Figure 3. ResNet50 architecture 


3.2.3. InceptionResNet-V2 model 

InceptionResnet-V2 [11] is a CNN architecture that is a development of InceptionResnet-V1. 
InceptionResnet-V2 is used for the transfer learning classification process and is built based on the inception 
architecture by combining residual connections. InceptionResnet-V2 was developed by replacing the filter 
part of the Inception process [39], [40]. 

In Figure 4, there are several stages of classification to detect mango images. The input information 
is an image that will be converted into a frame to apply image processing techniques. Then the mango image 
is detected to produce important features that are in accordance with the characteristics and special 
characteristics of the mango fruit [41]. Inception Resnet-V2 in detecting mangoes is used as a high-level 
feature extractor that provides image content that can help identify pests on mango fruit [42]. 


06 


in ean Inception- _ | Inception Reduction- | | Inception Reduction- | global 
pit l A ResNet-A A ResNet-B B avg-pool 


299 x 299 x3 x10 x20 


256 5 


Figure 4. InceptionResNet-V2 architecture 


3.2.4. Inception- V3 model 

Inception has four versions, namely Inception-V1 [7], Inception-V2 [12], Inception-V3 [12], and 
Inception-V4 [11]. The inception model uses several filters on the usual layers. The results of several filters 
are combined using a concatenated channel before entering the next iteration [43]. There are 48 Layers in 
Inception-V3, which is deeper than its predecessor deep convolutional neural network architecture named 
Inception-V1 or GoogLeNet [7]. 

Inception V3 network structure uses the convolution kernel splitting method to split large volume 
integrals into small convolutions. For example, the convolution 3x3 is divided into convolutions 3x1 and 
1x3. Through the separation method, the number of parameters can be reduced; hence, network training 
speed can be accelerated while spatial features can be extracted more effectively [44]. 

This study uses one of the deep learning neural network models, the Inception-V3 model used in 
TensorFlow to extract and classify mango fruit image features [45], [46]. The Inception-V3 model was used 
in TensorFlow to develop an image classifier to classify images of pests on mangoes based on three features: 
texture, shape, and color. The architecture is shown in Figure 5. 


3.2.5. DenseNet model 

In their paper, Huang et al. [13] introduced a densely connected convolutional network architecture. 
To ensure maximum information flow between layers in the network, all layers are connected directly with 
each other in a feed-forward manner. For each layer, the feature-maps of all preceding layers are used as 
inputs, and its own feature maps are used as inputs into all subsequent layers. DenseNets alleviate the 
problem of the vanishing-gradient problem and has substantially reduced number of parameters [13]. For this 
task of plant disease identification, DenseNets model with121 layers as described in [13] was created. 
Additionally, the model was loaded with pre-trained weights from ImageNet. Finally, another fully-connected 
model with our own customized softmax on the top layer was created. 
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Figure 5. Inception-V3 architecture 


3.3. Fine tuning 

This research adopts the transfer learning concept, where all the architecture above has been pre-trained 
using imageNet dataset for 1000 target classes. Transfer learning aims to use the knowledge gained during training 
in one type of problem is used to train in another related task or domain [4]. In order to adjust the network to fit 
our classification problem, the head of the network is replaced so that the number of target class are five classes. 

Fine-tuning is a concept of transfer learning. In our research, fine-tuning is carried out to the entire 
network in order to re-train all the weight parameters. Fine-tuned learning is started from the initial condition 
where the weight parameters are already trained on other problems. With new training set, it is need an 
adjustment for all the layers or particular layers. The researcher can select which layer needs to be re-trained 
and freeze other particular layers. Even though the training is needed for adjusting the new problem set, 
the initial knowledge can significantly cut the learning effort compared to training from scratch [23]. More 
importantly, in manual cases it is more accurate compared to models trained from scratch. 

In this research, the CNN models were fine-tuned to identify and classify five categories of fruit 
disease with pre-trained models on ImageNet dataset. ImageNet dataset is a huge collection of 1.2 Million 
labeled images in 1000 categories. The CNN architectures with the new head are re-train with a small 
number of mango fruit images. 


4. RESULTS AND DISCUSSION 

The research aims to identify the deep learning architecture with acceptable performance and high 
accuracy. Testing time and testing accuracy are two main considerations in the problem sets as the algorithms 
are designed to serve mobile client applications with multiple requests concurrently. Training time is 
important, but it was not the main consideration, since the training will only take place in the modelling task. 
It is worth mentioning that the server in these experiments is the prototype of the server that we use to serve 
the running pest visual recognition mobile application. 


4.1. Experiment setup and parameters 
The experiment in this research is conducted using computer server with specification: 
— Processor: 19 9900K. 
— Memory: 64 Gb. 
GPU: NVIDIA TITAN V, Memory 12 Gb, Tensor Cores 640, CUDA Cores 5120. 
This research did not optimize the parameter. The only parameters set is the learning rate at 0.00005. The rest 
of the parameters is set at default value. Optimization algorithm is using “Adam”. 


4.2. Evaluation metrics 

The accuracy and loss for training and validation set were recorded. Training time and validation 
time for the entire training and validation set were also become this research focus. Finally, the testing 
accuracy and average testing time among the algorithm was recorded to consider the most feasible model to 
be implemented in the implementation. The execution time might not repeatable to get the exact similar 
number in other research due to different experiment settings, however, the comparison between deep 
learning models might show a similar trend. 
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4.2.1. VGG16 

In VGGI16, the training loss decreased until 50 epochs. However, the validation loss starts to 
fluctuate in epoch 5. It shows us that model start to over fit with the data training. The minimum accuracy and 
validation loss are 0.0458 and 0.3108, respectively. The validation accuracy starts to be stagnant in epoch 5. 
The maximum training and validation accuracy are 0.9821 and 0.90769, respectively. The accuracy and loss 
graphic is shown in Figure 6. 


—#—ValAcc —#—TrainAcc —x—ValLoss —O—Train Loss 


Accuracy 
Loss 


0 5 10 15 20 


Figure 6. VGG16 training and validation accuracy and loss 


4.2.2. ResNet50 

In Resnet50, the training loss decrease until 50 epochs. However, the validation loss starts to 
increase in epoch 8. It shows us that model start to over fit with the data training. The minimum accuracy and 
validation loss are 0.0422 and 0.3885, respectively. The validation accuracy starts to be stagnant in epoch 12. 
The maximum training and validation accuracy are 0.9821 and 0.8923, respectively. The accuracy and loss 
graphic is shown in Figure 7. 


4.2.3. InceptionResNet-V2 

In InceptionResNet-V2, the training loss decrease until 50 epochs. However, the validation loss 
starts to increase in epoch 20. It shows us that model start to over fit with the data training. The minimum 
accuracy and validation loss are 0.0537 and 0.4207, respectively. The validation accuracy starts to be stagnant 
in epoch 9. The maximum training and validation accuracy are 0.9796 and 0.8846, respectively. The accuracy 
and loss graphic is shown in Figure 8. 
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Figure 7. RestNet50 training and validation accuracy and loss 
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Figure 8. InceptionResNet-V2 training and validation accuracy and loss 


4.2.4. Inception- V3 

In Inception-V3, the training loss decrease until 50 epochs. However, the validation loss starts to be 
flat in epoch 12. It shows us that model start to over fit with the data training. The minimum accuracy and 
validation loss are 0.053 land 0.5618, respectively. The validation accuracy starts to be stagnant in epoch 10. 
The maximum training and validation accuracy are 0.9821 and 0.8692, respectively. The accuracy and loss 
graphic is shown in Figure 9. 
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Figure 10. DensNet training and validation accuracy and loss 
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4.2.5. DenseNet121 

In DenseNet121, the training loss decrease until 50 epochs. However, the validation loss starts to 
increase in epoch 12. It shows us that model start to over fit with the data training. The minimum accuracy and 
validation loss are 0.0529 and 0.4579, respectively. The validation accuracy starts to be stagnant in epoch 20. 
The maximum training and validation accuracy are 0.9847and 0.8846, respectively. The accuracy and loss 
graphic is shown in Figure 10. 


4.3. Discussion 

According to Table 2, the least layer is VGG-16 model as it is only 16 layers. In the parameter size, 
the DenseNet121 produced 9.3 million parameters that is the smallest number parameter among five of them. 
Consequently, the model size of DenseNet121 is the smallest too that is 113 Mb. In the training accuracy, 
the VGG-16, ResNet50, and Densnet121 can achieve 0.9821 accuracies. Although, the smallest training loss 
is achieved by ResNet50 model. The best validation accuracy is achieved by VGG-16 and Resnet50. 
However, the VGG-16 has a slightly lower validation loss than ResNet50. The lower loss means the model 
can better predict the test data. It is proved by the testing accuracy of VGG-16 can overcome all competitor 
as it achieved 0.9076. In addition, the training and testing time of VGG-16 is the smallest comparing the 
other models. The training and testing times are 141.72 s and 2.15 s, respectively. VGG 16 architectures 
shows its superiority compare to the rest in term of testing accuracy and time. VGG16 is the shallowest 
architecture with 16 layers. Therefore, the model size is the smallest compared to deeper architectures. Based on 
the results, we can confidence that the VGG-16 model is suitable for fruit disease detection. 


Table 2. Accuracy and loss of training and execution time on 50 epochs 


Model Layers Parameter Model Training Training Validation Validation Testing Training Testing 
size size accuracy loss accuracy loss accuracy time (s) time (s) 
(mb) 

VGG-16 16 21,138,757 253 0.9821 0.0458 0.8923 0.3108 0.9076 141.72 2.15 
ResNet50 50 28,307,845 340 0.9821 0.0422 0.8923 0.3885 0.8923 173.09 15.67 
InceptionResNetV2 164 55,911,141 673 0.9796 0.0537 0.8846 0.4207 0.8846 554.12 49.88 
InceptionV3 48 23,901,447 287 0.9821 0.0531 0.8692 0.5618 0.8923 178.75 19.46 
DenseNet121 121 9,398,341 113 0.9847 0.0529 0.8846 0.4579 0.8923 274.29 30.9 


In Kusrini et al. [1] we implemented VGG16 classifier for recognizing the pest on leaf images. Since 

image data collection of infected mango leaf is not easy to collect huge number of data, an augmentation of the 
original sample was carried out in order to improve the classification performance. We achieved 71% on 
testing data for that experiment, it is due to the cluttered background, visual similarities among many 
different classes as mentioned by Barbedo [30]. In this paper, we expand the classification task to the mango 
fruit dataset and as can be seen in Table 2 the best achievement reported by VGG16 even without data 
augmentation implemented. The fact that the fruit dataset is as small as the leaf dataset but the background 
much tidy and the similarity between class much obvious. It is lead to well performance among all the evaluated 
architectures. 
We also found an interesting fact that the deeper architecture cannot improve better accuracy. The fact that the 
problem is simple with only five target classes, low interclass similarity and tidy background enable to simpler 
network to capture the pattern of the training very well. Longer network lead to overfitting network and it is 
indicated by the high training accuracy while lower validation and training accuracy. 

With the small dataset, we can achieve 90.76% of accuracy in test set and it would be quite useful if 
we brought thus result to the current implementation on mobile application pest detection. It is because the 
implementation is not purely automatics but we put the human in the loop. In the future, along with the 
implementation of this recognition task and human feedback, we do expect the rich dataset captured from the 
field. With more dataset and human in the loop as the user and the crowd labeling expert we expect more 
data and we can retrain the classifier and gain better recognition rate. 

The model built in this research will be applied in a mobile application. To reduce the possibility of 
errors due to different input data, before the process of classifying pests on fruit, the application will identify 
whether the image entered is a fruit image or something else. The identification model will use the results of 
previous research [47] by adding fruit data as one of the classes. 


5. CONCLUSION 

VGG 16 can effectively recognize the mango fruit pest with 90% of accuracy and about 0.0165 
second recognition time. The speed and the accuracy is acceptable for mobile application pest recognition 
system. The rest of the algorithms shows lower accuracy and time-consuming recognition therefore for 
current available dataset we conclude that the implementation of VGG16 is acceptable. 
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